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Real-time  Motion  Tracking  from  a  Mobile  Robot 
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Abstract — A  mobile  robot  needs  to  perceive  the  motions  of 
external  objects  to  perform  tasks  successfully  in  a  dynamic 
environment.  We  propose  a  set  of  algorithms  for  multiple  motion 
tracking  from  a  mobile  robot  equipped  with  a  monocular  camera 
and  a  laser  rangefinder.  The  key  challenges  are  1.  to  compensate 
the  ego-motion  of  the  robot  for  external  motion  detection,  and  2. 
to  cope  with  transient  and  structural  noise  for  robust  motion 
tracking.  In  our  algorithms,  the  robot  ego-motion  is  directly 
estimated  using  corresponding  feature  sets  in  two  consecutive 
images,  and  the  position  and  velocity  of  a  moving  object  is 
estimated  in  image  space  using  multiple  particle  filters.  The 
estimates  are  fused  with  the  depth  information  from  the  laser 
rangefinder  to  estimate  the  partial  3D  position.  The  proposed 
algorithms  have  been  tested  with  various  configurations  in 
outdoor  environments.  The  algorithms  were  deployed  on  three 
different  platforms;  it  was  shown  that  various  type  of  ego-motion 
were  successfully  eliminated  and  the  particle  filter  was  able  to 
track  motions  robustly.  The  multiple  target  tracking  algorithm 
was  tested  for  different  types  of  motions,  and  it  was  shown  that 
our  multiple  filter  approach  is  effective  and  robust.  The  tracking 
algorithm  was  integrated  with  a  robot  control  loop,  and  its  real¬ 
time  capability  was  demonstrated. 

Index  Terms —  mobile  robot,  motion  tracking,  ego-motion  com¬ 
pensation,  particle  filter. 

I.  Introduction 

MOTION  TRACKING  is  a  fundamental  capability  that 
a  mobile  robot  must  have  in  order  to  operate  in  a 
dynamic  environment.  Moving  objects  (eg.,  people)  are  often 
subjects  for  a  robot  to  interact  with,  and  in  other  contexts 
(eg. ,  traffic)  they  could  be  potentially  more  dangerous  for  safe 
navigation  compared  to  stationary  objects.  Further,  capabilities 
like  localization  and  mapping  critically  depend  on  separating 
moving  objects  from  static  ones.  Finally  motion  is  the  most 
critical  feature  to  track  for  many  surveillance  or  security  appli¬ 
cations.  For  instance,  a  building  monitoring  system  can  watch 
for  a  burglar  at  night  by  detecting  motion,  or  an  autonomous 
rescue  vehicle  can  search  for  victims  of  natural  disaster  by 
sensing  motion.  Clearly,  robust  motion  detection  and  tracking 
are  key  enablers  for  many  mobile  robot  applications. 

The  motion  tracking  problem  from  a  mobile  robot  is  illus¬ 
trated  in  Figure  1.  There  are  multiple  moving  objects  in  the 
vicinity  of  a  mobile  robot.  Measurements  from  sensors  on¬ 
board  the  robot  are  contaminated  with  noise,  and  an  estimation 
process  is  required  to  compute  the  positions  and  velocities 
of  the  moving  objects  in  the  robot’s  local  coordinate  system. 
In  the  variant  of  the  problem  studied  here,  we  require  real¬ 
time  estimates  without  prior  knowledge  about  the  number  of 
moving  objects,  the  motion  model  of  objects,  or  the  structure 
of  the  environment.  As  an  additional  restriction,  a  populated, 
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Fig.  1.  Motion  tracking  from  a  mobile  robot:  The  problem  is  to  estimate 
the  positions  and  velocities  of  target  motions  in  the  robot’s  local  coordinate 
system. 


unstructured  outdoor  environment  is  assumed.  Moving  object 
detection  in  a  structured  indoor  environment  has  been  rela¬ 
tively  well  studied  [1],  [2],  [3].  In  an  indoor  environment, 
moving  objects  are  often  clearly  distinguishable  from  the  rest 
of  the  environment  due  to  distinct  environmental  structures 
(eg.  straight  and  perpendicular  walls).  In  contrast,  an  outdoor 
environment  contains  objects  of  irregular  shapes,  and  it  is 
challenging  to  segment  moving  objects  from  the  background. 
In  addition,  outdoor  environments  contain  diverse  motions 
(varying  speeds,  frequencies  etc.). 

There  are  two  main  challenges  in  the  motion  tracking 
problem.  First,  there  are  two  independent  motions  involved 
-  the  ego-motion  of  the  mobile  robot  and  the  external  motions 
of  moving  objects.  Since  these  two  motions  appear  blended 
in  the  sensor  data,  the  ego-motion  of  the  robot  needs  to  be 
eliminated  so  that  the  remaining  motions,  which  are  due  to 
moving  objects,  can  be  detected.  Second,  there  are  various 
types  of  noise  added  at  various  stages.  For  example,  real 
outdoor  images  are  contaminated  by  various  noise  sources 
including  poor  lighting  conditions,  camera  distortion,  unstruc¬ 
tured  and  changing  shape  of  objects,  etc.  Perfect  ego-motion 
compensation  is  rarely  achievable,  thus  it  adds  another  type 
of  uncertainty  to  the  system.  Some  of  these  noise  terms  are 
transient  and  some  of  them  are  constant  over  time. 

Our  approach  to  the  problem  is  to  design  a  simple  and 
fast  ego-motion  compensation  algorithm  in  the  pre-processing 
stage  for  real-time  performance,  and  to  develop  a  probabilistic 
filter  in  the  post-processing  stage  for  uncertainty  and  noise 
handling.  Since  the  sequence  of  camera  images  contains  rich 
information  of  object  motion,  a  monocular  camera  is  utilized 
for  motion  detection  and  tracking.  A  laser  rangefinder  provides 
depth  information  of  image  pixels  for  partial  3D  position 
estimation.  Figure  2  shows  the  processing  sequence  of  our 
motion  tracking  system.  Frame  differencing,  which  compares 
two  consecutive  images  and  finds  motions  based  on  the 
difference,  is  exploited  for  motion  detection.  However,  when 
the  camera  moves  (eg.  when  it  is  mounted  on  a  mobile  robot), 


Report  Documentation  Page 

Form  Approved 

OMB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
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Fig.  2.  Processing  sequence  for  motion  tracking  from  a  mobile  robot:  A 
monocular  camera  is  utilized  for  motion  detection  and  tracking  in  image  space, 
and  the  result  is  fused  with  laser  scans  for  motion  estimation  in  2.5  dimension. 


straightforward  differencing  is  not  applicable  because  a  big 
difference  is  generated  by  the  motion  of  the  camera  even 
when  nothing  moves  in  the  environment.  Therefore,  the  ego- 
motion  of  the  camera  is  compensated  before  comparing  the 
previous  image  (Image(t— 1))  with  the  current  one  ( Image(t )). 
Assuming  that  the  ego-motion  compensation  is  perfect,  the 
difference  image  would  still  contain  structured  noise  on  the 
boundaries  of  objects  because  of  the  lack  of  depth  information 
from  a  monocular  image.  We  use  a  probabilistic  model  to  filter 
such  noise  out  and  to  perform  robust  detection  and  tracking. 
The  motions  of  moving  objects  are  modeled  using  a  Bayesian 
framework,  and  their  probability  distribution  in  image  space 
is  estimated  by  applying  perception  and  motion  models.  Once 
the  positions  and  velocities  of  moving  objects  are  estimated 
in  2D  image  space,  the  information  is  combined  with  the 
partial  depth  information  from  a  laser  rangefinder  in  order  to 
construct  a  3D  motion  model.  By  projecting  range  values  into 
an  image  space,  the  image  pixels  at  the  same  height  as  the 
laser  rangefinder  will  have  depth  information. 

The  performance  of  the  proposed  system  has  been  analyzed 
in  three  steps.  First,  the  robustness  of  the  motion  tracking 
algorithms  has  been  tested  on  various  robot  platforms  that 
have  unique  characteristics  in  terms  of  their  ego-motions.  The 
experimental  results  show  that  our  motion  tracking  system 
is  able  to  cope  with  various  types  of  ego-motions  and  the 
particle  filter  produces  robust  estimation.  Second,  a  multiple 
particle  filter  approach  for  multiple  motion  tracking  has  been 
validated  using  various  scenarios,  and  the  experimental  results 
show  that  the  multiple  filter  approach  tracks  all  motions 
successfully  for  the  cases  that  a  single  filter  approach  fails. 
Lastly,  the  proposed  tracking  system  has  been  integrated  with 
a  robot  control  loop,  and  its  robustness  and  real-time  capability 
have  been  examined.  The  experiments  demonstrate  that  the 
motion  tracking  system  is  robust  enough  that  the  system  is 
not  disturbed  by  other  moving  objects  once  it  starts  to  track 
an  object. 

The  rest  of  the  paper  is  organized  as  follows.  Section  II 
summarizes  the  related  work  on  this  topic.  The  detailed  ego- 
motion  compensation  algorithm  is  given  in  Section  III,  and  the 
design  of  the  probabilistic  filter  is  explained  in  Section  IV. 
Section  V  describes  how  to  fuse  the  estimation  result  in 
image  space  with  laser  rangefinder  data,  and  Section  VI 
reports  the  experimental  results  and  analyzes  the  performance 
of  the  proposed  algorithms.  The  current  status  and  possible 
improvements  are  discussed  in  Section  VII. 


II.  Related  Work 

The  computer  vision  community  has  proposed  various  meth¬ 
ods  to  stabilize  camera  motions  by  tracking  features  [4],  [5], 
[6]  and  computing  optical  flow  [7],  [8],  [9].  These  approaches 
focus  on  how  to  estimate  the  transformation  ( homography ) 
between  two  image  coordinate  systems.  However,  the  motions 
of  moving  objects  are  typically  not  considered,  which  leads  to 
poor  estimation. 

Other  approaches  that  extend  these  methods  for  motion 
tracking  using  a  pan/tilt  camera  include  those  in  [10],  [11], 
[12].  However,  in  these  cases  the  camera  motion  was  limited  to 
translation  or  rotation.  When  a  camera  is  mounted  on  a  mobile 
robot,  the  main  motion  of  the  camera  is  a  forward/backward 
movement,  which  makes  the  problem  different  from  that  of  a 
pan/tilt  camera. 

There  is  other  research  on  tracking  from  a  mobile  platform 
with  similar  motions.  [13]  tracks  a  single  object  in  forward- 
looking  infrared  (FLIR)  imagery  taken  from  an  airborne, 
moving  platform,  and  [14],  [15]  track  cars  in  front  using  a 
camera  mounted  on  a  vehicle  driven  on  a  paved  road. 

Once  motion  has  been  identified,  objects  in  the  scene 
need  to  be  tracked.  Work  focusing  on  robust  multiple  target 
tracking  using  probabilistic  filters  includes  [2]  which  uses  a 
particle  filter  to  track  people  indoors  (corridors)  using  a  laser 
rangefinder,  and  [16]  which  also  uses  a  particle  filter  to  track 
multiple  objects  using  a  stationary  camera.  A  Kalman  filter 
was  used  in  [17]  to  detect  and  track  human  activity  with  the 
combination  of  a  static  camera  and  a  moving  camera. 

III.  Ego-motion  Compensation 

The  ego-motion  compensation  is  a  coordinate  conversion 
procedure.  Assume  that  a  sensory  data  acquisition  process 
is  as  follows:  (1)  one  set  of  data  D  is  acquired  at  time  t 
when  a  robot  is  located  at  (x,y,a),  (2)  the  robot  (and  the 
sensors)  moves  to  ( x  +  Ax,  y  +  Ay,  a  +  A  a)  for  At,  and 
(3)  another  set  of  data  D'  is  acquired  at  time  t  +  At.  In 
this  case,  the  data  D  and  D'  cannot  be  compared  directly 
because  they  are  captured  in  different  coordinate  systems. 
Therefore,  the  data  D  should  be  compensated  for  the  ego- 
motion  (Ax,  Ay,  Aa),  which  means  that  the  data  D  should  be 
transformed  as  if  it  were  acquired  when  a  robot  was  located 
at  (x  +  Ax,  y  +  Ay,  a  +  Aa).  The  goal  of  the  ego-motion 
compensation  step  is  to  compute  this  transformation  T.  The 
transformation  can  be  estimated  directly  or  indirectly. 

The  indirect  method  is  to  estimate  a  robot  pose  each  time 
using  various  sensors  (eg.  gyroscope,  accelerometer,  odometer 
and/or  GPS),  and  compute  the  ego-motion  (Ax,  Ay,  A  a)  first. 
Once  the  ego-motion  is  computed,  the  transformation  T  can 
be  computed  based  on  the  geometric  properties  of  sensors. 
The  pose  estimation  technique  has  been  well  studied  [18], 
[19],  [20],  and  the  indirect  method  may  be  effective  when 
the  transformation  error  is  linear  in  the  ego-motion  estimation 
error.  However,  when  a  projection  operation  is  involved  in 
the  transformation,  as  in  our  case,  the  indirect  method  is  not 
appropriate  since  a  tiny  angular  error  in  the  motion  estimation 
step  would  induce  a  huge  position  error  after  being  projected 
into  the  data  space.  Therefore,  we  choose  the  direct  method. 
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The  direct  method  is  to  infer  the  transformation  T  by 
corresponding  salient  features  in  the  data  set  D  to  those  in  the 
data  set  D' .  The  feature  selection  and  matching  techniques  for 
various  types  of  sensors  has  been  studied  by  [4],  [7],  [21],  [22], 
[23].  Since  the  transformation  is  estimated  using  the  corre¬ 
sponding  feature  set  directly,  the  quality  of  the  transformation 
relies  on  the  quality  of  selected  features  from  the  data  sets. 
Unfortunately,  in  our  case,  the  quality  of  the  features  is  poor 
due  to  independently  moving  objects  in  image  data.  Therefore, 
a  transformation  model  and  outlier  detection  algorithm  needs 
to  be  designed  so  that  the  estimated  transformation  is  not 
sensitive  to  those  object  motions. 


A.  Feature  Selection  and  Tracking 


We  adopt  the  KLT  (Kanade-Lucas-Tomasi)  feature  tracking 
algorithm  introduced  in  [4],  [7],  [24],  [25]  to  select  and 
correspond  features  between  two  images.  The  KLT  algorithm 
has  become  a  standard  technique  for  feature-based  computer 
vision  algorithms.  For  completeness,  we  describe  the  algo¬ 
rithms  concisely  here. 

Given  two  consecutive  images  (the  previous  image  Jt_1 
and  the  current  /*),  a  set  of  features  is  selected  from  the 
image  Jt_1,  and  a  corresponding  feature  set  is  constructed 
by  tracking  the  same  features  on  the  image  I1 . 

For  feature  selection,  a  small  search  window  runs  over  the 
whole  image  /t_1  to  check  if  the  window  contains  a  reliably 
trackable  feature.  For  each  search  window, 


1)  Compute  the  boundary  information, 


dl(x,y) 

dx 


dl(x,y) 

dy 


(a)  indoor  features 


(b)  outdoor  features 

Fig.  3.  Salient  features  (filled  circles)  selected  for  tracking:  Primarily 
perpendicular  patterns  (eg.  corners)  or  divergent  textures  (eg.  leaves)  are 
selected. 


2)  Compute  the  covariance  matrix  of  the  boundary  pixels 

3)  Compute  two  eigenvalues  (Ai,A2)  of  the  covariance 
matrix 

4)  Select  a  search  window  such  that  min( Ai,  A2)  >  0 
Search  windows  with  two  small  eigenvalues  contain  no  pat¬ 
tern,  and  those  with  one  small  eigenvalue  and  one  big  eigen¬ 
value  contain  unidirectional  patterns,  which  are  not  easy  to 
track.  Only  search  windows  with  two  big  eigenvalues  are 
selected  for  tracking  because  they  contain  a  perpendicular 
pattern  (eg.  corners)  or  divergent  textures  (eg.  leaves)  which 
are  relatively  unique  enough  to  be  tracked.  The  feature  se¬ 
lection  algorithm  runs  on  the  image  /t_1,  and  generates  a 
set  of  features  Ft_1.  Figure  3  shows  the  features  selected 
from  indoor  and  outdoor  images.  In  the  indoor  image,  most 
of  the  selected  features  are  the  corners  of  objects,  like  desks, 
computers,  and  bookshelves.  In  the  outdoor  image,  some 
corners  of  bricks  and  cars,  leaves  and  grass  that  have  complex 
textures  were  selected  as  features. 

Once  the  feature  set  Ft_1  is  selected,  the  features  are 
tracked  on  the  subsequent  image  I1  and  the  set  of  tracked 
features  Ff  is  generated.  For  efficiency,  the  search  range  was 
limited  to  a  small  constant  distance  (assuming  a  bounded  robot 
speed).  Figure  4  shows  the  robustness  of  the  tracking  method. 
Figure  4  (a)  shows  the  features  selected  from  the  image  /t, 
and  Figure  4  (b)  shows  the  same  features  tracked  over  30 
frames  on  the  image  Jt+3°,  which  is  an  image  captured  3 
seconds  later.  The  erroneous  features  on  image  boundaries  are 
eliminated  for  subsequent  processing. 


(b)  tracked  features  at  time  t  +  30 


Fig.  4.  Feature  tracking:  (a)  shows  salient  features  selected,  and  (b)  shows 
the  same  features  tracked  over  30  frames  (3  seconds). 
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B.  Transformation  Estimation 

Once  the  correspondence  F=<Ft~1,Ft>  is  known,  the 
ego-motion  of  the  camera  can  be  estimated  using  a  transfor¬ 
mation  model  and  an  optimization  method.  We  have  studied 
three  different  models:  affine  model,  bilinear  model,  and  a 
pseudo-perspective  model. 


Affine  Model : 
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Pseudo-perspective  Model  : 
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When  the  interval  between  consecutive  images  is  very  small, 
most  ego-motions  of  the  camera  can  be  estimated  using  an 
affine  model,  which  can  cover  translation,  rotation,  shearing, 
and  scaling  motions.  However,  when  the  interval  is  long,  the 
camera  motion  in  the  interval  cannot  be  captured  by  a  simple 
linear  model.  For  example,  when  the  robot  moves  forward, 
the  features  in  the  image  center  move  slower  that  those  near 
the  image  boundary,  which  is  a  projection  operation,  not  a 
simple  scaling.  Therefore,  a  nonlinear  transformation  model 
is  required  for  those  cases.  On  the  other  hand,  an  over-fitting 
problem  may  be  caused  when  a  model  is  highly  nonlinear, 
especially  when  some  of  the  selected  features  are  associated 
with  moving  objects  (outliers).  There  is  clearly  a  trade-off 
between  a  simple,  linear  model  and  a  highly  nonlinear  model, 
and  it  needs  more  empirical  research  for  the  best  selection. 
We  used  a  bilinear  model  for  the  experiments  reported  in  this 
paper. 

When  the  transformation  from  the  image  Jt_1  to  the  image 
I1  is  defined  as  T/_1?  the  cost  function  for  least  square 
optimization  is  defined  as: 


J  = 


1  N 

-Y 

2 

i=  1 


{fi-TU  (fir1)} 


(i) 


where  N  is  the  number  of  features.  The  model  parameters 
for  ego-motion  compensation  are  estimated  by  minimizing  the 
cost.  However,  as  mentioned  before,  some  of  the  features  are 
associated  with  moving  objects,  which  lead  to  the  inference  of 
an  inaccurate  transformation.  Those  features  (outliers)  should 
be  eliminated  from  the  feature  set  before  the  final  transfor¬ 
mation  is  computed.  The  model  parameter  estimation  is  thus 
performed  using  the  following  two-step  procedure: 

1)  compute  the  initial  estimate  To  using  the  full  feature  set 

F. 

2)  partition  the  feature  set  F  into  two  subsets  Fin  and  Fout 
as: 

f /*  e  Fin  if  l/f-ToUiC/U1) 

1  fi  e  Fout  otherwise 


<  e 


(2) 


Fig.  5.  Outlier  feature  detection:  Outliers  are  marked  with  filled  circles,  and 
inliers  are  marked  with  empty  circles. 

3)  re-compute  the  final  estimate  T  using  the  subset  Fin 
only. 

Figure  5  shows  the  partitioned  feature  sets:  Fin  is  marked  with 
empty  circles,  and  Fout  is  marked  with  filled  circles.  Note 
that  all  features  associated  with  the  pedestrian  are  detected  as 
outliers.  It  is  assumed  for  outlier  detection  that  the  portion  of 
moving  objects  in  the  images  is  relatively  smaller  compared 
to  the  background;  the  features  which  do  not  agree  with  the 
main  motion  are  considered  as  outliers.  This  assumption  will 
break  when  the  moving  objects  are  very  close  to  the  camera. 
However,  most  of  the  time,  these  objects  pass  by  the  camera 
in  a  short  period  (leading  to  transient  errors),  and  a  high-level 
probabilistic  filter  is  able  to  deal  with  the  errors  without  total 
failure. 

C.  Frame  Differencing 

The  image  T-1  is  converted  using  the  transformation  model 
before  being  compared  to  the  image  F  in  order  to  eliminate 
the  effect  of  the  camera  ego-motion.  For  each  pixel  (x,y): 

Ic(x,y)  =  it-1(rtt_r1(x,y))  (3) 

Figure  6  (c)  shows  the  compensated  image  of  Figure  6  (a);  the 
translational  and  forward  motions  of  the  camera  were  clearly 
eliminated.  The  valid  region  5ft  of  the  transformed  image  is 
smaller  than  that  of  the  original  image  because  some  pixel 
values  on  the  border  are  not  available  in  the  original  image 
/t_1.  The  invalid  region  in  Figure  6  (c)  is  filled  black.  The 
difference  image  between  two  consecutive  images  is  computed 
using  the  compensated  image: 

1  (x  v)  =  1 1  y ) _  Jt(x’  y))  I  if  (x’  y) e  K  (4) 

d  1 0  otherwise 

Figure  7  compares  the  results  of  two  cases:  frame  differencing 
without  ego-motion  compensation  (Figure  7  (a))  and  with  ego- 
motion  compensation  (Figure  7  (b)).  The  results  show  that 
the  ego-motion  of  the  camera  is  decomposed  and  eliminated 
from  image  sequences.  The  full  description  of  the  frame 
differencing  process  is  given  in  Algorithm  1. 

IV.  Motion  Detection  in  2D  Image  Space 

The  Frame  Differencing  step  in  Figure  2  generates  the 
sequence  of  difference  images,  /§,/],•••  ,7^,  whose  pixel 
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(c)  Compensated  image  of  (a) 


Fig.  6.  Image  Transformation:  (c)  is  the  transformed  image  of  (a)  into  (b)  coordinates.  The  valid  region  of  the  compensated  image  is  smaller  than  that  of 
the  original  image  due  to  the  absence  of  data  on  the  border. 


(a)  difference  without  compensation 


(b)  difference  with  compensation 


Fig.  7.  Results  of  frame  differencing:  (b)  shows  that  the  ego-motion  of  a 
camera  was  decomposed  and  eliminated  from  image  sequences. 


values  represent  the  amount  of  motion  occurred  in  the  position. 
However,  as  described  earlier,  the  difference  images  contain 
two  different  types  of  errors.  There  are  transient  errors  caused 
by  imperfect  ego-motion  compensation,  and  this  type  of  error 
should  be  filtered  out  using  their  temporal  properties.  There  are 
also  persistent  errors  caused  during  data  acquisition.  Since  the 


camera  positions  are  different  when  two  consecutive  images 
are  captured,  it  is  inevitable  that  some  information  is  newly 
introduced  to  the  current  image  1 1  or  some  information  of  the 
previous  image  P_1  is  occluded  in  the  image  P. 

To  deal  with  those  errors,  a  probabilistic  approach  is 
adopted.  The  normalized  pixel  values  in  the  difference  images 
can  be  interpreted  as  the  probability  of  the  existence  of  moving 
objects  in  that  position,  and  the  position  and  size  of  the  moving 
objects  are  estimated  over  time.  This  estimation  process  can 
be  modeled  using  a  Bayesian  formulation.  Let  x*  represent 
the  state  (eg.  the  position  and  velocity)  of  a  moving  object. 

x=[xyxy]T  (5) 

The  posterior  probability  distribution  Pm(x4)  of  the  state  is 
derived  as  follows. 

Pm(V)  =  P(xV°,  /],..•  ,/')  (6) 

=  //  />(/,', x'. /:;••• ./'  b/’ix' 

=  ^P(/^|x*)P(xt|/0-..,/*-1 

=  rfpft  |x‘)  J  P(V|/d0--.  ./'-Ax'  >) 

=  V*  P(Id\ V)  J  Ax'  lx'-1) 

=  rf  P(/rf|x*)  J  P(xt|xt-1)Pm(xt“1)  c/x*-1 


Now  the  posterior  probability  distribution  can  be  updated 
recursively  by  applying  a  perception  model  P(/^|xt)  and  a 
motion  model  P(xt|xt_1)  over  time. 
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Algorithm  1:  Frame  Differencing  with  Ego-motion  Compensation 

Input:  two  consecutive  images  /t_1  and  I* 

Output:  the  difference  image  l\  and  the  ego-motion  transformation 

1t- 1 

1  begin 

2 

select  a  salient  feature  set  Fl  1  from  I 1  1  ; 

3 

track  the  features  on  I1,  and  generate  the  corresponding  feature  set 

Ft  ; 

4 

F 

<  F*-\  F*  >  ; 

5 

if  F  >  the  number  of  parameters  in  a  transformation  model  then 

/*  compute  the  initial  estimate  using  the  full  feature  set  */ 

6 

construct  an  input  matrix  X  using  Fl~x  ; 

7 

construct  a  target  vector  t  using  F*  ; 

8 

T0  <-  ( XTX)~ 1  XTt 

/*  remove  outliers  from  the  feature  set  */ 

9 

Fin  ^  0  ; 

10 

F0ut  <  0  5 

11 

foreach  </t_1,  fl>  in  the  set  F  do 

12 

if  |/t-T0(/‘-1)|  <€then 

13 

|  insert  </*  1 ,  f*  >  into  Fin\ 

14 

else 

15 

|  insert  </t_1,  ff>  into  Fout\ 

16 

end 

17 

end 

/*  compute  the  final  estimate  using  the  inliers  only.  */ 

18 

construct  an  input  matrix  X  using  only  F^ff1  ; 

19 

construct  a  target  vector  t  using  only  Ff  ; 

20 

T*_1  <-  (XTX)-1  XTt 

/*  compensate  ego-motion.  */ 

21 

initialize  Ic  with  all  0  ; 

22 

forall  (x,  y )  in  the  image  Ic  do 

23 

Ic(x,y)^-It-1(Tg1-1(x,y))  ; 

24 

end 

/*  frame  differencing  with  ego-motion  compensation.  */ 

25 

T 

1 

£ 

26 

else 

/*  no  compensation  is  feasible  without  sufficient  features.  */ 

27 

Tft_1  the  identical  transformation  ; 

28 

29 

end 

30 

return  F  and 

31  end 

A.  Bayesian  Filter  Design 

The  derived  equation  6  shows  how  the  sequence  of  dif¬ 
ference  images  and  the  motion  model  of  a  moving  object  are 
integrated  into  the  state  estimation  process.  However,  there  are 
still  three  questions  to  answer:  (1)  how  to  define  a  perception 
model,  and  (2)  how  to  define  a  motion  model,  and  finally 
(3)  how  to  represent  the  posterior  probability  distribution. 

The  perception  model  P(I\  |x*)  captures  the  idea  that  if 
there  is  a  motion  at  position  x  ,  then  the  difference  values  of 
the  pixel  in  that  position  and  its  neighbors  should  be  big.  For 
example,  let  us  assume  a  small  moving  object  that  occupies  a 
single  pixel  pt_1  on  a  camera  image.  When  the  object  moves 
to  its  neighbor  pixel  p*,  then  the  difference  values  of  both 
pixels  p*-1  and  p^  would  be  big.  This  neighborhood  can  be 
modeled  using  a  multi-variate  Gaussian,  and  the  perception 
model  can  be  defined  as 


P{ll  |x4)  = 


r\n\ 

)  = / 

J  o 


4(x)  X 


D-i(X-xt)TEs  1(x-xt) 


dx 

(7) 


when  d  is  the  dimension  of  the  state  x,  and  the  covariance 


matrix  controls  the  range  of  effective  neighborhood. 


Fig.  8.  Bayesian  filter  tracking  with  piecewise  constant  representation:  The 
left  window  shows  the  input  image  and  the  detected  moving  object  (ellipsoid), 
and  the  right  window  shows  the  posterior  probability  distribution  of  the 
moving  object  using  a  10x10  pixel  grid. 


The  motion  model  P(xt  |xt_1)  captures  the  best  guess  about 
the  motion  of  a  moving  object.  Since  no  prior  knowledge  of 
an  object  motion  is  assumed,  a  constant  velocity  model  is  a 
natural  choice.  The  uncertainty  of  an  object  motion  is  modeled 
by  the  covariance  matrix  of  a  multi- variate  Gaussian. 

xl~x  +  At  x  xl~x 

_  y1-1  +  At  x  y ‘-1 

F  —  £t- 1  W 

yt-l 

P(x‘  |  X*"1)  =  ,  1  e-i(*t-M)TSm1(xt-M)  (9) 

V(27T)d|V»| 

The  choice  of  a  representation  for  the  posterior  probability 
distribution  is  important  for  real-time  system  design  because 
the  update  equation  (Equation  6)  contains  an  integral  operation 
and  the  required  computation  is  intensive.  Even  when  the  size 
of  a  camera  image  is  small,  the  state  space  is  sizeable  because 
the  state  is  four-dimensional.  The  most  compact  representation 
is  to  use  a  single  Gaussian,  like  the  Kalman  filter  [26],  [27], 
but  this  approach  is  not  appropriate  because  (1)  the  initial 
state  of  a  moving  object  is  not  given  in  priori,  and  (2)  image 
segmentation  is  avoided  for  real-time  response,  which  makes 
it  hard  to  construct  a  measurement  matrix. 

A  better  approach  is  to  use  a  piecewise  constant  represen¬ 
tation.  By  decomposing  the  state  space  into  an  equally  spaced 
grid,  the  amount  of  computation  can  be  reduced  drastically. 
For  example,  Figure  8  shows  the  position  estimation  result 
using  a  10x10  pixel  grid.  However,  the  computation  was  still 
not  efficient  enough.  In  order  to  achieve  a  real-time  response, 
only  the  position  of  a  moving  object  was  estimated  (x  = 
[x  y]T),  and  a  constant  position  model,  instead  of  a  constant 
velocity  model,  was  utilized  as  the  consequence  of  the  simpler 
state  definition.  In  addition,  the  approximation  quality  of  the 
posterior  probability  distribution  was  sacrificed  by  using  the 
sparse  representation. 

The  most  popular  representation  that  addresses  these  con¬ 
cerns  is  a  sample-based  representation  [28],  [29].  The  amount 
of  computation  is  reduced  by  using  a  small  set  of  weighted 
samples,  but  the  approximation  quality  is  well  preserved 
by  concentrating  samples  on  the  area  whose  probability  is 
high.  Also,  the  number  of  samples  can  change  dynamically 
depending  on  the  shape  of  the  posterior  probability  distribution 
and  available  computer  power.  In  this  paper,  we  adopt  the 
sample-based  representation. 
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B.  Particle  Filter  Design 

The  Particle  filter  [28],  [29]  is  a  simple  but  effective 
algorithm  to  estimate  the  posterior  probability  distribution 
recursively,  which  is  appropriate  for  real-time  applications. 
In  addition,  its  ability  to  perform  multi-modal  tracking  is 
attractive  for  unsegmented  object  detection  and  tracking  from 
camera  images.  An  efficient  variant,  called  the  Adaptive  Par¬ 
ticle  Filter ,  was  introduced  in  [30].  This  changes  the  number 
of  particles  dynamically  for  a  more  efficient  implementation. 
We  implemented  the  Adaptive  Particle  Filter  to  estimate  the 
posterior  probability  distribution  in  Equation  6.  As  described 
in  Section  IV-A,  particle  filters  also  require  two  models  for  the 
estimation  process:  a  perception  model  and  a  motion  model. 

The  perception  model  is  used  to  evaluate  a  particle  and 
compute  its  weight  (or  importance).  Equation  7  provides  a 
generic  form  of  the  perception  model.  However,  the  perception 
model  is  simplified  for  efficiency.  A  step  function  is  used 
instead  of  a  multi-variate  Gaussian,  and  the  evaluation  range 
is  also  limited  to  m  x  m  fixed  area.  The  m  x  m  mask 
should  be  big  enough  (usually  5  x  5)  so  that  salt-and-pepper 
noise  is  eliminated.  The  weight  uj\  of  the  ith  particle  (s\  = 
[  x\  y\  x\  y\  ]T)  is  computed  by 

m/2  m/2 

=  Id(x\-j,y\-k)  (10) 

j=— m/2  k=—m/2 


Fig.  9.  Particle  filter  tracking:  The  position  of  particles  are  represented  by 
small  dots,  and  the  horizontal  bar  on  the  top-left  corner  shows  the  number  of 
particles  being  used. 


[30],  and  (2)  to  cluster  particles  efficiently  as  described  in 
Section  IV-C.  In  order  to  determine  the  proper  number  of 
particles,  a  kd-tree  with  uniform- size  nodes  is  built.  When 
the  size  of  the  tree  is  k,  the  error  bound  is  e,  and  the 
confidence  quantile  is  zis,  the  proper  number  is  computed 


—r~  i3  (12) 

9(k  —  1)  Zl~5j 


as  follows  [30] 

1  2 
2e 

k  —  1 


n  —  ~  Xk—1,1—5 


1  - 


2e 


9(k  —  1) 


+ 


As  shown  in  Equation  10,  only  the  position  information  is 
used  to  evaluate  particles. 

The  motion  model  is  used  to  propagate  a  newly  drawn 
particle  according  to  the  estimated  motion  of  a  moving  object. 
The  motion  model  in  and  Equation  9  describes  how  to  compute 
the  probability  of  the  new  state  x*  when  the  previous  state 
xt_1  is  given.  However,  for  particle  filter  update,  the  motion 
model  should  describe  how  to  draw  a  new  particle  s\  when  the 
previous  particle  s and  its  weight  uj\  are  given.  Therefore, 
the  motion  model  is  defined  as 

x +  At  x  x +  Normal(^fi) 

.  y t-~1  +  At  x  +  N ormali^r) 

a1.  —  n  1  f 

1  x +  Normal  (I;?) 

y -_1  +  Normal 

where  At  is  a  time  interval,  and  and  are  noise  parameters 
for  position  and  velocity  components  respectively.  The  func¬ 
tion  Normal  (a)  generate  a  Gaussian  random  variate  with  zero 
mean  and  the  standard  deviation  a.  As  shown  in  Equation  11, 
the  parameterized  noise  is  added  to  the  constant-velocity 
model  in  order  to  overcome  an  intrinsic  limitation  of  the 
particle  filter,  which  is  that  all  particles  move  in  a  convergence 
direction.  However,  a  dynamic  mixture  of  divergence  and 
convergence  is  required  to  detect  newly  introduced  moving 
objects.  [29]  introduced  a  mixture  model  to  solve  this  problem, 
but  in  the  image  space  the  probability  P(xt\I is  uniform  and 
the  dual  MCL  becomes  random.  Therefore,  we  used  a  simpler, 
but  effective  method  by  adding  inverse-proportional  noise. 

As  an  implementation  issue,  a  multi-dimensional  kd-tree 
is  constructed  during  the  particle  filter  update.  This  serves 
two  purposes:  (1)  to  compute  the  proper  number  of  particles 


Figure  9  shows  the  output  of  the  particle  filter.  The  dots 
represent  the  position  of  particles,  and  the  horizontal  bar 
on  the  top-left  corner  of  the  image  shows  the  number  of 
particles  being  used.  The  final  algorithm  of  the  particle  filter 
is  described  in  Algorithm  2. 

C.  Particle  Clustering 

The  particle  filter  generates  a  set  of  weighted  particles  that 
estimate  the  posterior  probability  distribution  of  a  moving 
object,  but  the  particles  are  not  easy  to  process  in  the  following 
step.  More  intuitive  and  meaningful  data  can  be  extracted  by 
clustering  the  particles.  A  density-based  algorithm  using  a  kd- 
tree  is  introduced  for  efficient  particle  clustering.  The  main 
idea  is  to  convert  a  set  of  weighted  particles  into  a  lower- 
resolution,  uniform-sized  grid.  The  grids  can  be  represented 
using  a  kd-tree  efficiently,  and  all  clustering  operations  are 
performed  using  the  grids  instead  of  particles.  Therefore,  the 
required  computation  is  reduced  drastically.  However,  since 
each  grid  maintains  enough  information  about  the  particles  in 
the  grid,  the  statistics  of  each  cluster  can  be  calculated  without 
any  accuracy  loss.  The  algorithm  consists  of  the  following  four 
steps: 

1)  Tree  Construction:  Given  a  set  of  weighted  particles,  a 
kd-tree  representation  is  constructed.  The  state  space  is  parti¬ 
tioned  into  uniform- sized  grid  cells,  and  only  non-empty  cells 
are  maintained  using  a  kd-tree.  Since  the  Adaptive  Particle 
Filter  requires  the  kd-tree  for  computing  the  proper  number 
of  particles  (as  described  in  Section  IV-B),  this  step  can  be 
combined  with  the  particle  filter  update.  The  information  of 
each  particle  is  not  necessary  anymore  for  the  subsequent 
steps,  but  a  few  extra  statistics  of  each  terminal  node  in  the 
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Algorithm  2:  Adaptive  Particle  Filter  for  Motion  Tracking 


Input:  the  previous  particle  set  St_1 
ego-motion  transformation  Ttt_1 
Output:  a  new  particle  set  S*  and  its  kd-tree  representation  Kf 

S°  <—  a  set  of  uniformly  weighted,  random  particles  with  the  size 
rimax  ? 

begin 

/*  transform  all  particles  to  compensate  an  ego-motion 


and  the 


3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 


14 


15 

16 

17 

18 

19 

20 


21 

22 

23 

24 


25 

26 
27  end 


t- 1 


>  in 


S*-1  do 


foreach  <  s 

|  s*-1 

end 

/*  perform  the  standard  particle  filter  update 

S*  <-  0  ; 

K*  ^  0  ; 

W  <-  0  ; 

repeat 


*/ 


*/ 


Ci 


,t-i 


Ek: =-m/2Id(X'i-j’y'i-k)  > 


.t-1  . 

A 

for  i=2  to  IS*-1 1  do  C*  C*_i 
generate  a  random  number  r  in  the  range  [0, 1)  ; 
select  the  itk  particle  s'  from  St~1  such  that 
Ci  <r  <  Ci+ 1  ; 

/*  propagate  the  particle  using  the  motion  &  sensor  models 

j  «-  -A  Em/2  /o 

a;'  +  At  x  x' 
y'  +  At  x  y\ 

y[ 

Ek=-m,2  Jtd  (xt  ~  o,  yl  -  k) ; 

/*  add  the  new  particle 
add  <  s*,  a;*  >  to  5*  ; 
add  <  sl ,  >  to  Kf  ; 

VU  <-  VU  +  LJ*  ; 
until  |S*|  < 

or  |5't|  <  -  2i  {l  —  9(prrpi)  \/ 9(|A* l-Ty^1-5}  ’ 

/*  normalize  the  weights  of  all  particles 

if  VU  >  0  then 

|  foreach  <  s*,  uj1  >  m  S*  do  u1  /W  ; 

else 


*/ 


N  ormaK^r) 
■ Normal (%) 

■  Normal 

■  Normality ) 


V 


V 


I  S*  a  set  of  uniformly  weighted,  random  particles  with  the 
|  size  rimax  5 

end 

return  S*  and  X* 


tree  need  to  be  computed.  For  each  terminal  node  fc,  the  weight 
Wk,  the  mean  fik,  and  the  covariance  matrix  of  the  node 
are  calculated  using  the  subset  of  particles  that  are  associated 
with  the  node. 

wk  =  Yji  wi 

Mfc  =  Y^iwisi  /  Y^iwi  (13) 

=  Ei  wiisi  -  r)i.si  -  r)T  /  Ei  wi 

2)  Candidate  Selection:  Instead  of  re-constructing  a  pos¬ 
terior  probability  distribution  and  thresholding  the  pdf, \  we 
select  candidate  grid  cells  whose  particle  density  is  bigger 
than  a  threshold  0.  Since  each  particle  has  different  weight, 
the  density  should  take  the  weight  into  account.  Theoretically 
this  means  Wk/volume(k)  should  be  used  as  the  determinant. 
However,  the  number  of  particles  (rik / volume(k))  can  be  used 
alternatively  for  simplicity  assuming  all  particles  have  uniform 
weights. 


Fig.  10.  Particle  clustering:  Two  ellipsoids  represent  the  means  and 
covariance  of  two  particle  clusters. 


3)  Grouping:  Once  the  candidate  nodes  are  selected,  clus¬ 
tering  can  be  done  by  simply  grouping  the  nodes  by  checking 
connectivity  among  nodes.  There  are  various  known  algo¬ 
rithms  for  this  task.  The  connectivity  can  be  defined  using 
the  distance  between  the  mean  vectors  of  two  nodes,  or  can 
be  determined  by  checking  if  a  node  is  a  neighbor  of  another. 

4 )  Statistics  Computation:  For  each  cluster,  the  statistics  of 
the  particles  in  the  cluster  can  be  calculated  by  summing  the 
statistics  of  the  nodes  in  the  cluster  incrementally. 

_  w  /i+Wfc  Uk 
W-\-Wk 

T(m  Mfc)  } 

~~  W  +  Wk 

=  w  +  wk 
= 

(14) 

Figure  10  shows  the  output  of  the  particle  clustering  al¬ 
gorithm.  The  dots  represent  the  position  of  particles,  and  the 
ellipsoid  represents  the  mean  and  covariance  of  each  cluster. 
The  full  description  of  the  clustering  algorithm  is  in  Table  3. 


E 

w 

[i 


D.  Multiple  Particle  Filters  for  Multiple  Motion  Tracking 

The  particle  filter  has  many  advantages  as  described  in  [31], 
[29];  one  of  the  advantages  is  multi-modality.  This  property 
is  attractive  for  multi-target  tracking  because  it  raises  the 
possibility  that  a  single  set  of  particles  can  track  multiple 
objects  in  an  image  sequence.  However,  that  is  true  only  under 
two  conditions: 

1)  The  perception  model  should  be  “bad”  enough  so  that 
particles  converge  slowly,  and  eventually  stay  on  mul¬ 
tiple  objects.  For  example,  imagine  the  case  in  which 
there  are  two  moving  objects  in  the  image  sequence. 
It  is  desired  that  a  single  set  of  particles  is  split  into 
two  groups,  and  each  group  converges  to  and  track  each 
objects  continuously.  Apparently  a  set  of  particles  would 
behave  as  desired  if  two  objects  show  exactly  the  same 
amount  of  motions  (on  average  over  time)  in  the  image 
sequence.  However,  this  assumption  is  not  realistic.  In 
most  cases,  one  would  show  a  “bigger”  motion  than 
the  other  due  to  different  size  and  shape  of  an  object, 
different  distance  to  a  camera,  etc.  The  behavior  of 
particles  is  quite  different  without  the  assumption.  Since 
the  amount  of  motion  is  different,  the  perception  model 
P(I*l  |xt)  in  Equation  6  generates  a  different  value  for 
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Algorithm  3:  Particle  Clustering  for  Post-processing 
Input:  the  kd-tree  representation  K  of  a  particle  set 
Output:  m  Gaussians  (<  m,  Si  >,  •  •  •  ,  <  /im,  >) 


1  begin 


7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 
29 


/*  select  a  set  of  candidate  nodes  C  based  on  their  density  */ 

C  ^  0  ; 

forall  a  terminal  node  k  in  K  do 
the  particles  in  the  node  k  are 

<  si,wi  >,<  S2,W2  >,  •  •  •  ,  <  sn,von  >  ; 

if  wi  /  volume  (fc)  >  6  /*  thresholding  by  weighted 
density  */ 

then 

/*  these  statistics  can  be  computed  when  the  tree  is  built  */ 
wk  ^I2iwi  ’ 

Uk  Si  wisi  /  Si  wi  J 

F/c  Si  -  MX**  -  M)T  /  J2iwi  ’ 

add  <  wk,  /Xfc,  Efc  >  to  C  ; 


end 

end 

/*  group  the  candidate  nodes  using  connectivity 
for  z  =  1  to  |C|  do 

if  Ci  ^  ou_y  group  then  create  a  new  group  ; 
for  j  =  i  +  1  to  \C\  do 
if  Cj  £  any  group  then 

|  if  \pi  —  pj  |  <  p  then  add  c3  to  the  group  Gi 
else  if  Gi  /  Gj  and  \pi  —  ptj  \  <  p  then 


*/ 


|  combine  the  group  Gi  and  the  group  Gj 


end 


end 


end 


*/ 


/*  compute  the  statistics  of  each  group 
foreach  a  group  Gi  do 

Wi  <—  0  ; 

Mi  <—  0  ; 

^  ^0  ; 

forall  a  candidate  <  wk ,  p,k,  >  in  Gi  do 
/  gH  Ui+Wk  Uk  . 

Si  <- 


end 


Mi 


end 

34  end 

35  return  <  pi,  Si 


>,• 


-  wk 


,  <  Mr7 


each  object.  As  a  result,  particles  on  the  smaller  motion 
would  shift  to  the  bigger  motion  after  some  iterations 
even  when  the  size  of  the  two  particle  groups  was  the 
same  in  the  beginning.  This  behavior  is  expected  because 
Equation  6  is  designed  to  estimate  the  position  of  a 
single  object  x.  This  limitation  can  be  overcome  in 
two  ad-hoc  ways:  (1)  By  making  the  perception  model 
less  sensitive  so  that  particles  converge  very  slowly, 
(2)  By  increasing  the  number  of  particles.  The  first 
technique  is  feasible  when  the  convergence  speed  is  not 
important  [32],  [29].  However,  convergence  speed  is  a 
critical  factor  for  motion  tracking.  When  a  new  object  is 
introduced,  a  filter  should  be  able  to  detect  it  and  start 
to  track  it  in  a  reasonable  time.  The  second  technique  is 
not  desirable  since  the  required  computation  increases 
drastically. 

2)  All  objects  should  be  introduced  in  the  beginning  of 
estimation  process.  As  explained  in  Section  IV-A,  par¬ 
ticles  shows  a  convergence  tendency,  and  consequently 


(a)  single  person  at  t  =  23  (b)  three  people  at  t  =  109 


Fig.  11.  Problem  of  a  single  particle  filter:  (a)  shows  that  the  particle  filter 
converges  on  the  person,  and  (b)  shows  that  the  filter  stays  on  the  person 
continuously  even  when  other  people  are  introduced  in  the  image  later. 


converged  particles  do  not  diverge  unless  the  tracked 
object  disappears.  For  example,  Figure  11  shows  a 
problematic  scenario.  The  particle  set  converges  when 
the  person  enters  into  the  field  of  view  of  the  camera 
as  in  Figure  11  (a),  and  it  concentrates  on  the  person 
continuously  even  when  two  more  people  enter  later  as 
in  Figure  11  (b).  This  problem  is  not  trivial  to  solve 
using  a  limited  number  of  particles. 

Therefore,  we  introduce  a  tracking  system  using  multiple 
particle  filters.  The  main  idea  is  to  maintain  an  extra  particle 
filter  for  a  newly  introduced  or  detected  object.  Since  the 
number  of  objects  is  not  known  in  priori,  particle  filters  should 
be  created  and  destroyed  dynamically.  Whenever  the  extra 
particle  filter  converges  on  a  newly  detected  object,  a  new 
particle  filter  is  created  (as  long  as  the  number  of  particle 
filters  is  smaller  than  the  maximum  limit  Nmax).  Similarly, 
whenever  a  particle  filter  diverges  due  to  the  disappearance 
of  a  tracked  object,  it  is  destroyed.  In  order  to  prevent  two 
particle  filters  from  converging  on  the  same  object,  whenever 
a  particle  filter  is  updated,  the  difference  image  is  modified 
for  subsequent  processing  such  that  difference  values  covered 
by  the  filter  are  cleared.  The  detailed  algorithm  is  described 
in  Algorithm  4. 

V.  Position  Estimation  in  3D  Space 

A  monocular  image  provides  rich  information  for  ego- 
motion  compensation  and  motion  tracking  in  2D  image  space. 
However,  a  single  camera  has  limits  on  retrieving  depth 
information,  and  an  additional  sensor  is  required  to  construct 
3D  models  of  moving  objects.  Our  robots  are  equipped  with  a 
laser  rangefinder,  which  provides  depth  information  within  a 
singe  plane.  Given  the  optical  properties  of  a  camera  and  the 
transformation  between  the  camera  and  the  laser  rangefinder, 
distance  information  from  the  laser  rangefinder  can  be  pro¬ 
jected  onto  the  image  coordinates  (Figure  12). 

Given  the  heading  a  and  the  range  r  of  a  scan,  the  projected 
position  (x,  y )  in  the  image  coordinate  system  is  computed  as 
follows: 


X 

w  „  ( 1  tan(o') 

2  X  1/  tan (fh) J 

_  y . 

[|x(l  +(d  ^  x  (r  l))  x  ixta^})  J 

(15) 

where  the  focal  length  of  the  camera  is  /,  the  horizontal 
and  vertical  field  of  view  of  the  camera  are  fh  and  fv,  the 
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Algorithm  4:  Multiple  Particle  Filters  for  Multiple  Motion  Tracking 

Input:  the  previous  particle  filter  set  Pt_1,  the  difference  image  /J, 
and  the  ego-motion  transformation  Tj_1 
Output:  a  new  particle  filter  set  P* 


create  a  particle  filter  po  ; 

P°  <-  fro}  ; 

begin 

/*  update  all  particle  filters  in  the  set  */ 

P*  <-  0  ; 

foreach  a  particle  filter  pi  in  PT  1  do 

update  the  particle  filter  pi  using  the  difference  image  Ic]  ; 
retrieve  the  cluster  information  Ci  from  the  updated  particle 
filter  pi  ; 

if  |  Ci  |  >  0  /*  check  the  convergence  */ 

then 

Pt  <-  pf  u  {Pi}  ; 

pt_x  ^pt-i  _{p.}  . 

forall  a  cluster  <  //,, ,  S,  >  m  Ci  do 


|  Id^  Id~  region 


end 


end 


end 


18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 


/*  add  an  extra  particle  filter  if  allowed 
if  IP*-1 1  >  0  then 

select  a  particle  filter  p'  from  P*_1  ; 
reset  the  particle  filter  p'  ; 

P*  <—  P*  u  {pi}  ; 

pt-i  _fr.}  . 

else 

if  |P*|  <  7Vmaa;  then 

create  a  new  particle  filter  p'  ; 
P*  ^  P*  U  frfr  ; 

end 


*/ 


focal  length 


Fig.  12.  Projection  of  laser  scans  onto  the  image  coordinates:  The  range 
scans  from  a  laser  rangefinder  can  be  projected  onto  the  image  coordinate 
system  based  on  the  optical  properties  of  a  camera  and  the  transformation 
between  the  camera  and  the  laser  rangefinder. 


Fig.  13.  Projected  laser  scans:  The  image  pixels  at  the  same  height  as  the 
laser  rangefinder  have  depth  information. 


VI.  Experimental  Results  and  Discussion 


/*  destroy  unused  particle  filters 

29  forall  a  particle  filter  pi  in  Pt_1 

30  |  destroy  the  particle  filter  pi  ; 

31  end 

32  end 

33  return  Pf 


do 


*/ 


height  from  the  laser  rangefinder  to  the  camera  is  d,  and  the 
image  size  is  w  x  h.  This  projection  model  assumes  a  very 
simple  camera  model  (a  pin-hole  camera)  for  fast  computation. 
As  a  result  of  the  projection,  the  image  pixels  at  the  same 
height  as  the  laser  rangefinder  will  have  depth  information 
as  shown  in  Figure  13.  For  ground  robots,  this  partial  3D 
information  can  be  enough  for  safe  navigation  assuming  all 
moving  obstacles  are  on  the  the  same  plane  as  the  robot.  In 
terms  of  moving  object  tracking,  if  the  region  of  a  moving 
objects  in  image  space  and  those  pixels  are  overlapped,  then 
the  distance  between  a  robot  and  the  moving  object  can  be 
estimated. 

This  naive  integration  of  the  2D  motion  estimates  and 
range  scans  from  a  laser  rangefinder  is  a  reasonable  practical 
solution.  However,  using  two  separate  sensors  requires  another 
estimation  problem  potentially,  which  is  the  fusion  of  multiple 
asynchronous  inputs.  A  preferred  route  (not  investigated  here) 
would  be  to  use  stereo  vision  for  depth  information  retrieval. 
If  computational  power  allows  one  can  exploit  the  facts  that 
stereo  (1)  provides  full  depth  information  of  an  image  space, 
and  (2)  a  single  input  source  provides  synchronous  data  and 
better  fusion  result  can  be  expected. 


The  proposed  motion  tracking  system  was  tested  in  various 
scenarios.  First,  the  robustness  of  ego-motion  compensation 
and  motion  tracking  algorithms  was  tested  using  three  different 
robot  platforms.  The  performance  is  analyzed  in  Section  VI- 
A.  Second,  the  multiple-motion  tracking  system  described  in 
Section  IV-D  was  tested  using  various  scenarios.  The  results 
are  presented  in  Section  VI-B.  Finally,  the  tracking  system 
was  integrated  with  an  actual  robot  control  system.  The  result 
is  discussed  in  Section  VI-C. 

A.  Tracking  a  Moving  Object  from  Various  Platforms 

The  ego-motion  of  a  mobile  robot  is  diverse  according  to  its 
actuator  design  and  the  way  the  camera  is  mounted  on  the  plat¬ 
form.  For  example,  a  down-facing  camera  mounted  on  an  UAV 
(Unmaned  Aerial  Vehicle)  would  show  a  different  ego-motion 
from  a  forward-facing  camera  mounted  on  a  walking  robot.  In 
addition,  the  complexity  of  the  ego-motion  increases  through 
the  interaction  with  rough  terrain.  The  tracking  performance 
is  also  affected  by  the  distribution  of  occlusive  obstacles  in 
an  environment.  Therefore,  the  ego-motion  compensation  and 
the  motion  tracking  algorithms  should  be  tested  on  various 
environments  with  a  wide  variety  of  mobile  platforms. 

1)  Experimental  Setup :  The  tracking  algorithms  were  im¬ 
plemented  and  tested  in  various  outdoor  environments  using 
three  different  robot  platforms:  robotic  helicopter,  Segway 
RMP,  and  Pioneer2  AT.  Each  platform  has  unique  character¬ 
istics  in  terms  of  its  ego-motion. 

The  Robotic  Helicopter  [33]  in  Figure  14  (a)  is  an  au¬ 
tonomous  flying  vehicle  carrying  a  monocular  camera  facing 
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(a)  Robotic  Helicopter 


(b)  Segway  RMP 


(c)  Pioneer2  AT 


Fig.  14.  Robot  platforms  for  experiments:  Each  platform  has  unique  characteristics  in  terms  of  its  ego-motion. 


downward.  Once  it  takes  off  and  hovers,  planar  movements  are 
the  main  motion,  and  moving  objects  on  the  ground  stay  at  a 
roughly  constant  distance  from  the  camera  most  of  the  time; 
however,  pitch  and  roll  motions  still  generate  complicate  video 
sequences.  Also,  high-frequency  vibration  of  the  engine  adds 
motion-blur  to  camera  images. 

The  Segway  RMP  in  Figure  14  (b)  is  a  two- wheeled, 
dynamically  stable  robot  with  self-balancing  capability.  It 
works  like  an  inverted  pendulum;  the  wheels  are  driven  in 
the  direction  that  the  upper  part  of  the  robot  is  falling,  which 
means  the  robot  body  pitches  whenever  it  moves.  Especially 
when  the  robot  accelerates  or  decelerates,  the  pitch  angle 
increases  by  a  significant  amount.  Since  all  sensors  are  directly 
mounted  on  the  platform,  the  pitch  motions  prevent  direct 
image  processing.  Therefore,  the  ego-motion  compensation 
step  should  be  able  to  cope  with  not  only  planar  movements 
but  also  pitch  motions. 

The  Pioneer2  AT  in  Figure  14  (c)  is  a  typical  four-wheeled, 
statically  stable  robot.  Since  the  Pioneer2  robot  is  the  only  stat¬ 
ically  stable  platform  among  these  robot  platforms,  we  drove 
the  robot  on  the  most  severe  test  environment.  Figure  17  shows 
the  rocky  terrain  where  the  robot  was  driven.  In  addition,  the 
moving  objects  were  occluded  occasionally  because  of  the 
trees  in  the  environment. 

The  computation  was  performed  on  embedded  computers 
(Pentium  III  1.0  GHz)  on  the  robots.  Low  resolution  (320x240 
pixels)  input  images  were  chosen  for  real-time  response.  The 
maximum  number  of  particles  was  set  to  5,000,  and  the 
minimum  number  of  particles  was  set  to  1000.  Since  the 
algorithm  is  supposed  to  run  in  parallel  with  other  processes 
(eg.  navigation  and  communication),  less  than  70  percent 
of  the  CPU  time  was  dedicated  for  tracking;  the  tracking 
algorithm  was  able  to  process  five  frames  per  second. 

2)  Experimental  Results:  The  performance  of  the  tracking 
algorithm  was  evaluated  by  comparing  with  the  positions 
of  manually  tracked  objects.  For  each  video  sequence,  the 
rectangular  region  of  moving  objects  were  marked  manually 
and  used  as  ground  truth.  Figure  15-17  show  this  evaluation 
process.  The  upper  rows  show  the  input  image  sequence,  and 
the  positions  of  manually-tracked  objects  are  marked  with 
rectangles.  The  lower  rows  show  the  set  of  particles  and  the 
clustering  results.  The  position  of  each  particle  is  marked  with 
dots,  and  the  horizontal  bar  on  the  top-left  corner  of  the  image 
indicates  the  number  of  particles  being  used.  The  clustering 
result  is  represented  using  an  ellipsoid  and  a  line  inside.  The 
ellipsoid  shows  the  mean  and  covariance  of  the  estimated 


object  position,  and  the  line  inside  of  the  ellipsoid  represents 
the  estimated  velocity  vector  of  the  object. 

The  final  evaluation  result  is  shown  in  Table  I.  Frames 
is  the  number  of  image  frames  in  a  video  sequence,  and 
Motions  is  the  number  of  moving  objects.  Detected  is  the 
total  number  of  detected  objects,  and  True  +  and  False  + 
are  the  number  of  correct  detections  and  the  number  of  false- 
positives  respectively.  Detection  Rate  shows  the  percentage 
of  moving  objects  correctly  detected,  and  Avg.  Error  is  the 
average  Euclidean  distance  in  pixels  between  the  ground  truth 
and  the  output  of  tracking  algorithm.  The  average  distance 
error  should  not  be  considered  as  actual  error  measurement 
since  the  tracking  algorithm  does  not  perform  an  explicit 
object  segmentation;  it  may  track  a  part  of  an  object  that 
generates  motion  while  the  ground  truth  always  tracks  the 
whole  objects  even  though  only  part  of  the  object  moves. 

The  Robotic  helicopter  result  shows  that  the  tracking  al¬ 
gorithm  missed  seven  objects,  but  five  of  them  were  the 
cases  when  a  moving  object  was  introduced  and  showed  only 
partially  on  the  boundary  of  the  image  plane.  Once  the  whole 
object  entered  into  the  field  of  view  of  the  camera,  the  tracking 
algorithm  tracked  it  robustly.  For  the  Segway  RMP  result,  the 
detection  rate  was  satisfactory,  but  the  average  distance  error 
was  larger  than  the  others.  The  reason  was  that  the  walking 
person  was  closer  to  the  robot  and  the  tracking  algorithm  often 
detected  the  upper  body  only,  which  caused  a  constant  distance 
error.  The  Pioneer2  AT  result  shows  the  higher  ratio  of  false- 
positives;  however,  as  explained  in  the  previous  section,  the 
terrain  for  the  experiment  was  more  challenging  (rocky)  and 
the  input  images  were  more  blurred  and  unstable.  Overall 
various  types  of  ego-motions  were  successfully  eliminated 
from  input  images,  and  the  particle  filter  was  able  to  track 
motions  robustly  from  diverse  robot  platforms. 

B.  Tracking  Multiple  Moving  Objects 

The  multiple-motion  tracking  system  using  multiple  particle 
filters  was  introduced  in  Section  IV-D.  Since  the  robustness  of 
an  individual  filter  was  analyzed  in  Section  VI-A,  we  focus 
on  analyzing  how  multiple  filters  are  created  and  destroyed 
effectively  when  the  number  of  moving  objects  changes  dy¬ 
namically. 

1)  Experimental  Setup :  The  Segway  RMP  in  Figure  14  (b) 
was  selected  for  the  experiment  because  of  its  complex  ego- 
motion.  The  Segway  RMP  is  a  dynamically  stable  platform, 
and  its  pitching  motions  for  self-balancing  are  combined  into 
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(d)  t  =  13 


(e)  t  =  21 


Fig.  15.  Moving  object  tracking  from  Robotic  helicopter:  The  upper  row  shows  the  input  image  sequence  with  manually-tracked  objects,  and  the  lower  row 
shows  the  particle  filter  outputs  and  clustering  results. 


(d)  t  =  57  (e)  t  =  119  (f)  t  =  195 

Fig.  16.  Moving  object  tracking  from  Segway  RMP:  The  upper  row  shows  the  input  image  sequence  with  manually-tracked  objects,  and  the  lower  row 
shows  the  particle  filter  outputs  and  clustering  results. 


(a)  t  =  57 


(c)  t  =  195 


(b)  t  =  119 


TABLE  I 

Performance  of  moving  object  detection  algorithm 


Platform 

Frames 

Motions 

Detected 

True  + 

False  + 

Detection  Rate 

Avg.  Error 

Robotic  helicopter 

43 

35 

28 

28 

0 

80.00  % 

11.90 

Segway  RMP 

230 

220 

215 

211 

4 

95.90  % 

21.31 

Pioneer2  AT 

195 

172 

158 

146 

12 

84.88  % 

15.87 
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(d)  t  =  35  (e)  t  =  56  (f)  t  =  191 

Fig.  17.  Moving  object  tracking  from  Pioneer2  AT:  The  upper  row  shows  the  input  image  sequence  with  manually-tracked  objects,  and  the  lower  row  shows 
the  particle  filter  outputs  and  clustering  results. 


(a)  t  =  35 


(b)  t  =  56 


(c)  t  =  191 


linear  and  angular  motions.  The  combination  of  these  motions 
causes  complicated  ego-motions  even  when  the  robot  is  driven 
on  a  flat  terrain  or  when  the  robot  stops  in  place.  The  robot  was 
driven  on  the  USC  campus  during  the  daytime  when  there  are 
diverse  activities  in  the  environment  including  walking  people 
and  automobiles. 

The  tracking  performance  is  analyzed  for  three  different 
cases.  As  explained  in  Section  IV-D,  if  one  of  the  following 
two  conditions  is  not  satisfied,  a  single  particle  filter  fails  to 
track  multiple  objects  even  though  it  supports  multi-modality 
in  theory:  (1)  all  objects  should  be  introduced  before  a  particle 
filter  converges,  and  (2)  the  convergence  speed  of  a  particle 
filter  should  be  sacrificed  by  using  a  “bad”  perception  model. 
The  first  two  cases  are  when  one  or  both  conditions  can  not 
be  satisfied.  In  the  first  case,  there  are  three  people  walking 
by,  but  the  people  are  introduced  in  the  input  image  sequence 
one  by  one,  which  violates  the  first  condition.  In  the  second 
case,  there  are  two  groups  of  automobiles  passing  by,  and 
they  are  introduced  sequentially  with  a  big  time  interval 
between  them.  In  addition,  the  automobiles  move  fast  enough 
so  that  the  convergence  speed  of  a  particle  filter  cannot  be 
sacrificed,  which  violates  the  second  condition.  The  results 
for  both  cases  show  how  the  multiple  particle  filter  approach 
overcomes  the  limitation  of  a  single  particle  filter.  The  stability 
of  this  approach  is  also  clear.  In  the  last  case,  it  is  observed 
how  multiple  particle  filters  behave  when  two  people  walk  in 
different  directions  and  intersect  in  the  middle. 

The  computation  was  performed  on  a  Pentium  IV  (2.1  GHz) 
computer,  and  the  image  resolution  was  fixed  to  320x240 
pixels.  The  maximum  number  of  particle  filters  was  fixed 
to  five,  and  for  an  individual  particle  filter,  the  range  of  the 
number  of  particles  was  set  to  (1000  ~  5000).  The  number 


of  frames  processed  per  second  varies  based  on  how  many 
particle  filters  have  been  created,  but  roughly  10  frames  were 
able  to  be  processed. 

2)  Experimental  Results:  The  snapshots  of  the  multiple 
particle  filter  tracking  multiple  moving  objects  are  shown  in 
Figure  18-20.  The  upper  rows  of  the  figures  show  input  image 
sequences  and  manually-tracked  moving  objects  in  the  images. 
The  manually-tracked  objects  are  marked  with  rectangles.  The 
lower  rows  show  particle  filters  and  the  covered  area  (the 
minimum  rectangular  region  enclosing  each  ellipsoid  that  is 
generated  by  the  particle  clustering  algorithm)  by  each  particle 
filter.  Only  converged  particle  filter  is  visualized  on  the  images. 
Each  particle  filter  is  drawn  with  different  colored  dots,  and 
the  covered  areas  are  marked  with  rectangles. 

The  experimental  result  of  the  first  case  is  shown  in  Fig¬ 
ure  18.  The  estimation  process  starts  with  a  single  particle 
filter.  When  the  first  person  enters  into  the  field  of  view  of 
the  camera  as  in  Figure  18  (a),  the  particle  filter  converges 
and  starts  to  track  the  person  as  in  Figure  18  (d),  and  a  new 
particle  filter  is  created  to  explore  the  remained  area.  When 
the  second  person  enters  as  in  Figure  18  (b),  the  new  particle 
filter  converges  and  starts  to  track  the  second  person  as  in 
Figure  18  (e),  and  another  particle  filter  is  created.  This  process 
is  repeated  whenever  a  new  object  is  introduced.  At  the  end 
when  three  people  are  in  the  input  image  as  in  Figure  18  (c), 
the  total  number  of  particle  filters  becomes  four;  three  filters 
for  people  and  one  extra  filter  to  explore. 

Figure  19  shows  the  experimental  result  of  the  second  case. 
The  estimation  process  is  performed  in  the  same  way  with  the 
first  case.  Whenever  a  new  automobile  is  introduced,  a  new 
particle  filter  is  created.  When  the  automobile  leaves  from  the 
field  of  view  of  the  camera,  the  particle  filter  that  tracks  the 
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(d)  t  =  49  (e)  t  =  95  (f)  t  =  132 


Fig.  18.  Tracking  people:  There  are  three  people  walking  by.  The  upper  row  shows  the  input  image  sequence  with  manually-tracked  objects,  and  the  lower 
row  shows  the  particle  filter  outputs  and  clustering  results. 


(d)  t  =  51  (e)  t  =  66  (f )  t  =  85 


Fig.  19.  Tracking  automobiles:  There  are  two  cars  passing  by  followed  by  a  third  car.  The  upper  row  shows  the  input  image  sequence  with  manually-tracked 
objects,  and  the  lower  row  shows  the  particle  filter  outputs  and  clustering  results. 


TABLE  II 

Performance  of  moving  object  detection  algorithm 


Case 

Frames 

Motions 

Detected 

True  + 

False  + 

Detection  Rate 

Avg.  Error 

(1)  Pedestrians 

141 

285 

274 

274 

0 

96.14  % 

8.68 

(2)  Automobiles 

123 

120 

95 

92 

3 

76.67  % 

20.40 

(3)  Intersection 

81 

162 

156 

156 

0 

96.30  % 

12.34 
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(d)  t  =  27  (e)  t  =  38  (f)  t  =  67 


Fig.  20.  Intersectional  particles:  There  are  two  people  intersecting  in  the  image.  The  upper  row  shows  the  input  image  sequence  with  manually-tracked 
objects,  and  the  lower  row  shows  the  particle  filter  outputs  and  clustering  results. 


automobile  diverges  and  is  destroyed  eventually. 

The  experimental  result  of  the  third  case  is  shown  in 
Figure  20.  There  are  two  people  walking  in  different  directions 
as  in  Figure  20  (a),  and  two  particle  filters  are  created  to 
track  them  individually  as  in  Figure  20  (d).  The  pedestrians 
intersect  in  the  middle,  and  keep  walking  in  each  direction. 
Figure  20  (e)  and  (f)  demonstrate  that  the  particle  filters  track 
them  successfully  without  being  confused  by  the  intersection. 

The  detailed  evaluation  result  is  shown  in  Table  II.  There 
were  untracked  motions  in  common;  however,  it  happens  only 
right  after  a  new  object  is  introduced  and  before  a  filter 
converges  on  the  object.  Once  a  particle  filter  converges  and  is 
associated  with  the  object,  it  never  fails  to  track  the  object.  For 
the  second  case,  the  detection  rate  is  lower  than  the  other  two 
cases.  This  is  reasonable  since  the  motion  of  an  automobile 
is  much  faster  than  that  of  a  person,  and  it  took  longer  for  a 
particle  filter  to  converge  on  it.  The  false-positives  observed  in 
the  second  case  are  also  related  to  the  faster  motion.  When  an 
automobile  leaves  from  the  camera  field  of  view,  it  disappears 
quickly  enough  so  that  the  particle  filter  stays  converged  for 
one  or  two  frames.  In  general,  the  multiple  particle  filter 
approach  shows  stable  performance  for  all  cases. 

C.  Close  the  loop:  Following  a  Moving  Object 

The  proposed  tracking  system  is  integrated  with  a  robot 
control  loop,  and  its  robustness  and  real-time  capability  are 
tested.  For  this  experiment,  the  task  of  a  robot  is  to  wait  for 
a  moving  object  to  appear  and  follow  the  object. 

1)  System  Design  and  Implementation:  A  robot  needs 
two  capabilities  to  accomplish  the  task:  motion  detection 
and  tracking,  and  local  navigation.  For  motion  detection  and 
tracking,  the  system  described  in  Section  III-V  is  utilized.  This 


Monitoring 

Layer 


Tracking 

Layer 


Navigation 

Layer 


Fig.  21.  Control  architecture  for  motion  following  system:  The  “motion 
tracker”  box  represents  the  proposed  motion  tracking  system. 


module  takes  the  sequence  of  camera  images  and  the  laser 
range  scans  as  inputs,  and  computes  the  existence  of  moving 
object(s)  and  the  estimation  of  target  positions  in  the  robot’s 
local  coordinates.  For  local  navigation,  VFH+  (Vector  Field 
Histogram  +)  [34]  algorithm  is  implemented.  Internally,  VFH+ 
algorithm  performs  two  tasks:  (1)  it  retrieves  range  scans 
from  a  laser  rangefinder,  and  build  a  local  occupancy  grid 
map  for  obstacle  avoidance,  and  (2)  when  the  estimated  target 
position  is  given,  the  algorithm  generates  both  translational 
and  rotational  motor  commands  for  point-to-point  navigation. 

The  system  architecture  is  presented  in  Figure  21.  The 
implemented  system  was  deployed  on  a  Segway  RMP  robot. 
The  computation  was  performed  on  an  embedded  computer 
(Athlon  1.0  GHz)  on  the  robot,  and  the  image  resolution  was 
fixed  at  320x240  pixels. 

2)  Experimental  Result:  The  snapshots  of  the  robot  fol¬ 
lowing  a  person  are  shown  in  Figure  22.  When  the  person 
entered  into  the  field  of  view  of  the  camera  as  in  Figure  22  (a), 
the  particle  filter  converged  on  him,  and  the  Segway  started 
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to  follow  it.  As  shown  in  Figure  22  (b)-(e)  and  (h)-(i),  there 
were  automobiles,  a  golf  car,  and  pedestrians  passing  by  in  the 
background.  However,  the  tracking  system  was  not  confused 
by  them;  the  Segway  was  able  to  track  the  person  without 
a  single  failure.  When  the  person  stopped  and  stood  still  as 
in  Figure  22  (1),  the  particle  filter  diverged,  which  made  the 
robot  stop.  However,  when  the  person  re-started  to  walk  as  in 
Figure  22  (m),  the  particle  filter  was  able  to  detect  the  motion 
again,  and  the  robot  re-started  to  follow  the  person. 

VII.  Conclusion 

We  have  presented  a  set  of  algorithms  for  multiple  motion 
tracking  from  a  mobile  robot  in  real-time.  There  are  three 
challenges:  1)  Compensation  for  the  robot  ego-motion :  The 
ego-motion  of  the  robot  was  directly  measured  using  corre¬ 
sponding  feature  sets  in  two  consecutive  images  obtained  from 
a  camera  rigidly  attached  to  the  robot.  In  order  to  eliminate 
the  unfavorable  effect  of  a  moving  object  in  the  image  se¬ 
quence,  an  outlier  detection  algorithm  has  been  proposed.  2) 
Transient  and  structural  noise:  An  adaptive  particle  filter  has 
been  designed  for  robust  motion  tracking.  The  position  and 
velocity  of  a  moving  object  were  estimated  by  combining 
the  perception  model  and  the  motion  model  incrementally. 
Also,  the  multiple  target  tracking  system  has  been  designed 
using  multiple  particle  filters.  3)  Sensor  fusion:  The  depth 
information  from  a  laser  rangefinder  was  projected  into  the 
image  space,  and  the  partial  3D  position  information  was 
constructed  in  the  region  of  overlap  between  the  range  data 
and  the  image  data. 

The  proposed  algorithms  have  been  tested  with  various 
configurations  in  outdoor  environments.  First,  the  algorithms 
were  deployed  on  three  different  platforms  ( Robotic  Heli¬ 
copter,  Segway  RMP,  and  Pioneer2  AT),  and  tested  in  different 
environments.  The  experimental  results  showed  that  various 
type  of  ego-motions  were  successfully  eliminated  from  input 
images,  and  the  particle  filter  was  able  to  track  motions 
robustly.  Second,  the  multiple  target  tracking  algorithm  was 
tested  for  different  types  of  motions.  The  experimental  results 
show  that  multiple  particle  filters  are  created  and  destroyed 
dynamically  to  track  multiple  targets  introduced  at  different 
times.  Lastly,  the  tracking  algorithm  was  integrated  with  a 
robot  control  loop  to  test  its  real-time  capability,  and  the  task 
of  following  a  moving  object  was  successfully  accomplished. 

The  proposed  algorithms  are  expected  to  be  utilized  in 
various  application  domains  as  a  key  enabler.  Localization  and 
mapping  problems  have  been  studied  actively  by  the  mobile 
robotics  community,  and  there  are  many  well-developed  tech¬ 
niques  widely  used.  However,  most  techniques  assume  a  static 
environment,  and  their  performance  is  degraded  significantly 
when  there  are  dynamic  objects  in  an  environment.  Our  algo¬ 
rithm  provides  a  robust  method  to  detect  dynamic  objects  in  an 
environment,  and  it  can  be  exploited  in  a  pre-processing  step 
to  filter  out  data  that  are  associated  with  the  dynamic  objects. 
Safe  navigation  is  another  fundamental  problem  in  mobile 
robotics,  and  most  solutions  generate  motion  commands  based 
on  local  positions  of  obstacles.  However,  those  solutions 
become  unreliable  when  obstacles  are  dynamic,  especially 


when  the  obstacles  move  faster  than  the  robot.  In  this  case,  a 
mobile  robot  needs  to  predict  obstacle  position  in  the  near 
future  to  avoid  collision.  The  motion  velocity  (speed  and 
direction)  estimation  capability  of  our  algorithm  could  fulfill 
this  requirement.  Human-robot  interaction  is  active  research 
area  in  service  robot  applications,  and  locating  a  subject  to 
interact  with  is  a  key  problem.  Our  algorithm  is  applicable 
in  this  regard.  Needless  to  say,  surveillance  and  security 
applications  could  make  use  of  our  algorithms  to  detect  and 
track  moving  targets. 
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Fig.  22.  Snapshots  of  a  Segway  RMP  robot  following  a  person 
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